Abstract In this workflow, we will use R/Bioconductor packages to explore, process, visualise and understand mass spectrometry-based proteomics data, starting with raw data, and proceeding with identification and quantitation data, discussing some of their peculiarities compared to sequencing data along the way. The workflow is aimed at a beginner to intermediate level, such as, for example, seasoned R users who want to get started with mass spectrometry and proteomics, or proteomics practitioners who want to familiarise themselves with R and Bioconductor infrastructure.
This material available under a creative common CC-BY license. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially.
Before we start:
If you identify typos, if there are parts that you would like to see expended or clarified, please let me know by telling me directly (during workshops), opening a github issue or by emailing me. Please do also briefly specify your background/familiarity with mass spectrometry and/or proteomics (beginner, intermediate or expert) so that I can update accordingly.
In recent years, there we have seen an increase in the number of packages to analyse mass spectrometry and proteomics data for R and Bioconductor, as well as an increase in total number of downloads. See vignette Proteomics packages in Bioconductor for more details and code underlying these figures.
It is also good to highlight that several of these package have become a group efforts, supported by several developers in the community. This post illustrates the various contributions to MSnbase. mzR has benefited by a similar wide range of successful contributions. Both packages, and in particular mzR, are used by many others, and will be described in some detail in this workflow.
This workflow illustrates R / Bioconductor infrastructure for proteomics. Topics covered focus on support for open community-driven formats for raw data and identification results, packages for peptide-spectrum matching, data processing and analysis:
Links to other packages and references are also documented. In particular, the vignettes included in the RforProteomics package also contains relevant material.
This workflow provides a general introduction to Bioconductor software for mass spectrometry and proteomics. If you are interested in
vignette("pRoloc-tutorial", package = "pRoloc") or online.vignette("msnid_vignette", package = "MSnID") or online. In addition, the vignettes of the msmsTest package describe how to analyse spectral counting data using packages dedicated for the analysis of high throughput sequencing data.vignette("MALDIquant-intro", package = "MALDIquant") and available online.vignette("Cardinal-walkthrough", package = "Cardinal") and online.The follow packages will be used throughout this documents. R version 3.3.1 or higher is required to install all the packages using BiocInstaller::biocLite.
library("mzR")
library("mzID")
library("MSnID")
library("MSnbase")
library("rpx")
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
library("MSGFplus")
library("rols")
library("hpar")
The most convenient way to install all the tutorials requirement (and more related content), is to install RforProteomics with all its dependencies.
library("BiocInstaller")
biocLite("RforProteomics", dependencies = TRUE)
Other packages of interest, such as rTANDEM or MSGFgui will be described later in the document but are not required to execute the code in this workflow.
In Bioconductor version 3.6, there are respectively 92 proteomics, 62 mass spectrometry software packages and 17 mass spectrometry experiment packages. These respective packages can be extracted with the proteomicsPackages(), massSpectrometryPackages() and massSpectrometryDataPackages() and explored interactively, or looked at by exploring the respective biocViews on the Bioconductor web page.
library("RforProteomics")
pp <- proteomicsPackages()
display(pp)
Most community-driven formats described in the table are supported in R. We will see how to read and access these data in the following sections.
| Type | Format | Package |
|---|---|---|
| raw | mzML, mzXML, netCDF, mzData | MSnbase (read and write in version >= 2.3.13) via mzR |
| identification | mzIdentML | mzID (read) and MSnbase (read, via mzR) |
| quantitation | mzQuantML | |
| peak lists | mgf | MSnbase (read) |
| other | mzTab | MSnbase (read) |
Mass spectrometry (MS) is a technology that separates charged molecules (ions) based on their mass to charge ratio (M/Z). It is often coupled to chromatography (liquid LC, but can also be gas-based GC). The time an analytes takes to elute from the chromatography column is the retention time.
A chromatogram, illustrating the total amount of analytes over the retention time.
An mass spectrometer is composed of three components:
When using mass spectrometry for proteomics, the proteins are first digested with a protease such as trypsin. In mass shotgun proteomics, the analytes assayed in the mass spectrometer are peptides.
Often, ions are subjected to more than a single MS round. After a first round of separation, the peaks in the spectra, called MS1 spectra, represent peptides. At this stage, the only information we possess about these peptides are their retention time and their mass-to-charge (we can also infer their charge be inspecting their isotopic envelope, i.e the peaks of the individual isotopes, see below), which is not enough to infer their identify (i.e. their sequence).
In MSMS (or MS2), the settings of the mass spectrometer are set automatically to select a certain number of MS1 peaks (for example 20). Once a narrow M/Z range has been selected (corresponding to one high-intensity peak, a peptide, and some background noise), it is fragmented (using for example collision-induced dissociation (CID), higher energy collisional dissociation (HCD) or electron-transfer dissociation (ETD)). The fragment ions are then themselves separated in the analyser to produce a MS2 spectrum. The unique fragment ion pattern can then be used to infer the peptide sequence using de novo sequencing (when the spectrum is of high enough quality) of using a search engine such as, for example Mascot, MSGF+, …, that will match the observed, experimental spectrum to theoratical spectra (see details below).
Schematics of a mass spectrometer and two rounds of MS.
The animation below show how 25 ions different ions (i.e. having different M/Z values) are separated throughout the MS analysis and are eventually detected (i.e. quantified). The final frame shows the hypothetical spectrum.
Separation and detection of ions in a mass spectrometer.
The figures below illustrate the two rounds of MS. The spectrum on the left is an MS1 spectrum acquired after 21 minutes and 3 seconds of elution. 10 peaks, highlited by dotted vertical lines, were selected for MS2 analysis. The peak at M/Z 460.79 (488.8) is highlighted by a red (orange) vertical line on the MS1 spectrum and the fragment spectra are shown on the MS2 spectrum on the top (bottom) right figure.
Parent ions in the MS1 spectrum (left) and two sected fragment ions MS2 spectra (right).
The figures below represent the 3 dimensions of MS data: a set of spectra (M/Z and intensity) of retention time, as well as the interleaved nature of MS1 and MS2 (and there could be more levels) data.
MS1 spectra over retention time.
MS2 spectra interleaved between two MS1 spectra.
MS-based proteomics data is disseminated through the ProteomeXchange infrastructure, which centrally coordinates submission, storage and dissemination through multiple data repositories, such as the PRoteomics IDEntifications (PRIDE) database at the EBI for mass spectrometry-based experiments (including quantitative data, as opposed as the name suggests), PASSEL at the ISB for Selected Reaction Monitoring (SRM, i.e. targeted) data and the MassIVE resource. These data can be downloaded within R using the rpx package.
library("rpx")
pxannounced()
## 15 new ProteomeXchange annoucements
## Data.Set Publication.Data Message
## 1 PXD010394 2018-07-12 07:13:42 New
## 2 PXD009320 2018-07-11 16:43:15 New
## 3 PXD007706 2018-07-11 16:33:36 New
## 4 PXD010332 2018-07-11 16:30:52 New
## 5 PXD008019 2018-07-11 16:29:27 New
## 6 PXD007881 2018-07-11 15:52:37 Updated information
## 7 PXD008005 2018-07-11 15:33:04 New
## 8 PXD009062 2018-07-11 15:21:20 Updated information
## 9 PXD003188 2018-07-11 15:13:06 New
## 10 PXD008783 2018-07-11 15:11:03 New
## 11 PXD008782 2018-07-11 15:07:36 New
## 12 PXD009662 2018-07-11 15:05:42 New
## 13 PXD009509 2018-07-11 15:04:17 Updated information
## 14 PXD008525 2018-07-11 15:03:49 New
## 15 PXD004334 2018-07-11 14:57:29 New
Using the unique PXD000001 identifier, we can retrieve the relevant metadata that will be stored in a PXDataset object. The names of the files available in this data can be retrieved with the pxfiles accessor function.
px <- PXDataset("PXD000001")
px
## Object of class "PXDataset"
## Id: PXD000001 with 12 files
## [1] 'F063721.dat' ... [12] 'generated'
## Use 'pxfiles(.)' to see all files.
pxfiles(px)
## [1] "F063721.dat"
## [2] "F063721.dat-mztab.txt"
## [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"
## [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"
## [5] "PXD000001_mztab.txt"
## [6] "README.txt"
## [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
## [8] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML"
## [9] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
## [10] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"
## [11] "erwinia_carotovora.fasta"
## [12] "generated"
Other metadata for the px data set:
pxtax(px)
## [1] "Erwinia carotovora"
pxurl(px)
## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001"
pxref(px)
## [1] "Gatto L, Christoforou A. Using R and Bioconductor for proteomics data analysis. Biochim Biophys Acta. 2014 1844(1 pt a):42-51"
Data files can then be downloaded with the pxget function. Below, we retrieve the raw data file. The file is downloaded2 If the file is already available, it is not downloaded a second time. in the working directory and the name of the file is return by the function and stored in the mzf variable for later use 3 This and other files are also availabel in the msdata package, described below.
fn <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
mzf <- pxget(px, fn)
## Downloading 1 file
## /home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML already present.
mzf
## [1] "/home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
AnnotationHub is a cloud resource set up and managed by the Bioconductor project that serves various omics datasets. It is possible to contribute and access (albeit currently only a limited number of) proteomics data.
library("AnnotationHub")
ah <- AnnotationHub()
## snapshotDate(): 2018-04-30
query(ah, "proteomics")
## AnnotationHub with 4 records
## # snapshotDate(): 2018-04-30
## # $dataprovider: PRIDE
## # $species: Erwinia carotovora
## # $rdataclass: AAStringSet, MSnSet, mzRident, mzRpwiz
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH49006"]]'
##
## title
## AH49006 | PXD000001: Erwinia carotovora and spiked-in protein fasta file
## AH49007 | PXD000001: Peptide-level quantitation data
## AH49008 | PXD000001: raw mass spectrometry data
## AH49009 | PXD000001: MS-GF+ identiciation data
ms <- ah[["AH49008"]]
ms
## Mass Spectrometry file handle.
## Filename: 55314
## Number of scans: 7534
The data contains 7534 spectra - 1431 MS1 spectra and 6103 MS2 spectra. The file name, 55314, is not very descriptive because the data originates from the AnnotationHub cloud repository. If the data was read from a local file, is would be named as the mzML (or mzXML) file (see below).
Some data are also distributed through dedicated packages. The msdata, for example, provides some general raw data files relevant for both proteomics and metabolomics.
library("msdata")
## proteomics raw data
proteomics()
## [1] "MRM-standmix-5.mzML.gz"
## [2] "MS3TMT10_01022016_32917-33481.mzML.gz"
## [3] "MS3TMT11.mzML"
## [4] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"
## [5] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"
## proteomics identification data
ident()
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
More often, such experiment packages distribute processed data; an example of such is the pRolocdata package, that offers quantitative proteomics data.
The MSnbase package provides high-level data abstractions for raw MS data through the MSnExp class and containers for quantification data via the MSnSet class (see Quantitative proteomics section). Both store
spectra (or the [, [[ operators) or exprs;data.frame with pData;data.frame with fData.Another useful slot is processingData, accessed with processingData(.), that records all the processing that objects have undergone since their creation (see examples below).
MSnExp classThe readMSData will parse the raw data and construct an MS experiment object of class MSnExp. An important argument to readMSData is the mode, which can be "onDisk" or "inMemory". The former doesn’t load the raw data in memory (which is not advised for MS1 data, or when many files are loaded) and is generally the recommended mode. See the benchmarking vignette4 Open it with vignette("benchmarking", package = "MSnbase") or read it online for details).
library("MSnbase")
## get a small test data
rawFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzXML$")
basename(rawFile)
## [1] "dummyiTRAQ.mzXML"
msexp <- readMSData(rawFile, msLevel = 2L)
msexp
## MSn experiment data ("MSnExp")
## Object size in memory: 0.18 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 5
## MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Thu Jul 12 11:30:37 2018
## MSnbase version: 2.6.1
## - - - Meta data - - -
## phenoData
## rowNames: dummyiTRAQ.mzXML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: F1.S1 F1.S2 ... F1.S5 (5 total)
## fvarLabels: spectrum
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
Spectra can be extracted as a list of Spectrum2 objects with the spectra accessor or as a subset of the original MSnExp data with the [ operator. Individual spectra can be accessed with [[.
length(msexp)
## [1] 5
msexp[1:2]
## MSn experiment data ("MSnExp")
## Object size in memory: 0.07 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 2
## MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Thu Jul 12 11:30:37 2018
## Data [numerically] subsetted 2 spectra: Thu Jul 12 11:30:37 2018
## MSnbase version: 2.6.1
## - - - Meta data - - -
## phenoData
## rowNames: dummyiTRAQ.mzXML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: F1.S1 F1.S2
## fvarLabels: spectrum
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
msexp[[2]]
## Object of class "Spectrum2"
## Precursor: 546.9586
## Retention time: 25:2
## Charge: 3
## MSn level: 2
## Peaks count: 1012
## Total ion count: 56758067
We can also extract the chromatogram for the acquistion(s) in the MSnExp object and visualise it. Here, we use a complete acquisition from the msdata package, and read it with on-disk mode and focus on MS1 data, which is used to generate chromatograms.
f <- msdata::proteomics(pattern = "45stepped_60min_01-20141210", full.names = TRUE)
rw <- readMSData(f, mode = "onDisk", msLevel. = 1L)
chr <- chromatogram(rw)
chr
## Chromatograms with 1 row and 1 column
## TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz
## <Chromatogram>
## [1,] length: 1431
## phenoData with 1 variables
## featureData with 1 variables
plot(chr)
plot of chunk chrom1
Note that here, as we only loaded a single raw data file, we obtain a Chromatograms object with a single chromatogram. When reading multiple raw data files at once (for example with readMSData(c("f1.mzML", "f2.mzML"))), we would get and visualise one chromatogram per file.
The identification results stemming from the same raw data file can then be used to add PSM matches. Here, we use the small msexp test data with 5 MS2 spectra that we read in further up.
## initial feature variable
fData(msexp)
## spectrum
## F1.S1 1
## F1.S2 2
## F1.S3 3
## F1.S4 4
## F1.S5 5
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "dummyiTRAQ.mzid")
basename(identFile)
## [1] "dummyiTRAQ.mzid"
msexp <- addIdentificationData(msexp, identFile)
## additional feature variables
fvarLabels(msexp)
## [1] "spectrum" "acquisition.number"
## [3] "sequence" "chargeState"
## [5] "rank" "passThreshold"
## [7] "experimentalMassToCharge" "calculatedMassToCharge"
## [9] "modNum" "isDecoy"
## [11] "post" "pre"
## [13] "start" "end"
## [15] "DatabaseAccess" "DBseqLength"
## [17] "DatabaseSeq" "DatabaseDescription"
## [19] "idFile" "MS.GF.RawScore"
## [21] "MS.GF.DeNovoScore" "MS.GF.SpecEValue"
## [23] "MS.GF.EValue" "modName"
## [25] "modMass" "modLocation"
## [27] "subOriginalResidue" "subReplacementResidue"
## [29] "subLocation" "nprot"
## [31] "npep.prot" "npsm.prot"
## [33] "npsm.pep"
We see that 3 out of 5 MS2 spectra in the msexp data have been identified; those that haven’t have missing values for the new, id-related feature variables.
fData(msexp)$rank
## [1] 1 1 NA NA 1
fData(msexp)$isDecoy
## [1] FALSE FALSE NA NA FALSE
Exercise Load all MS level data from file
MS3TMT11.mzMLavailable in themsdatapackage usingreadMSData, making sure you setmode = "onDisk", and verify which MS levels (accessible with themsLevelfunction) are centroided (accessible with thecentroided()function). See section Raw data processing for data in profile and centroided (processed) modes.
f <- proteomics(full.names = TRUE, pattern = "MS3TMT11.mzML")
ms <- readMSData(f, mode = "onDisk")
table(centroided(ms), msLevel(ms))
##
## 1 2 3
## FALSE 30 0 0
## TRUE 0 482 482
Spectra and (parts of) experiments can be extracted and plotted.
msexp[[1]]
## Object of class "Spectrum2"
## Precursor: 645.3741
## Retention time: 25:1
## Charge: 3
## MSn level: 2
## Peaks count: 2921
## Total ion count: 668170086
plot(msexp[[1]], full=TRUE)
Plotting an object of class Spectrum.
As this data was labeled with iTRAQ4 isobaric tags, we can highlight these four peaks of interest with
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
Plotting an object of class Spectrum with reporter ions.
msexp[1:3]
## MSn experiment data ("MSnExp")
## Object size in memory: 0.11 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 3
## MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Thu Jul 12 11:30:37 2018
## Data [numerically] subsetted 3 spectra: Thu Jul 12 11:30:39 2018
## MSnbase version: 2.6.1
## - - - Meta data - - -
## phenoData
## rowNames: dummyiTRAQ.mzXML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: F1.S1 F1.S2 F1.S3
## fvarLabels: spectrum acquisition.number ... npsm.pep (33 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
plot(msexp[1:3], full=TRUE)
Plotting an object of class MSnExp
In the examples ablove, we only used a single file as input to readMSData, but multiple file can be read into a single MSnExp object. The origin of the spectra can be accessed with the fromFile function:
fromFile(msexp)
## F1.S1 F1.S2 F1.S3 F1.S4 F1.S5
## 1 1 1 1 1
Exercise Repeat the previous combination of raw and identification data with the
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gzandTMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzidfiles frommsdata. Retain only MS 2 level data; this can be done either when reading the data in (see themsLevelargument in?readMSData) or can be done afterwards by filtering the MS levels withfilterMsLevel.
## read raw data
rwf <- proteomics(pattern = "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz",
full.names = TRUE)
tmterw <- readMSData(rwf, mode = "onDisk")
## or, only read MS2-leve data
## tmterw <- readMSData(rwf, mode = "onDisk", msLevel = 2L)
## add identification data
idf <- ident(full.names = TRUE)
tmterw <- addIdentificationData(tmterw, idf)
tmterw2 <- filterMsLevel(tmterw, 2L)
## It is also possible to chain operations
library("magrittr")
tmterw2 <- rwf %>%
readMSData(mode = "onDisk") %>%
addIdentificationData(idf) %>%
filterMsLevel(2L)
Exercise Still using the
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210data from the previous exercise, identify the index of the MS2 spectrum with the highest precursor intensity (see theprecursorIntensityfeature variable) and plot it as illustrated above.
i <- which.max(precursorIntensity(tmterw2))
sp <- tmterw2[[i]]
plot(sp, full = TRUE)
As seen in the introduction, scans have a hierarchical structure: MS2 spectra stem form a precursor MS1 scan. This also holds for MS3 spectra, that are the result from an additional analysis round of MS2 spectra. When validating quantitative or identification data by referring back to raw data, it is often useful to be able to navigate this structure.
We will use an experiment with 3 MS levels to do this:
ms3f <- proteomics(pattern = "MS3TMT11", full.names = TRUE)
basename(ms3f)
## [1] "MS3TMT11.mzML"
ms3 <- readMSData(ms3f, mode = "onDisk")
Note that it is important to use on-disk mode here, as we want to retain all MS levels, which isn’t possible with in-memory mode.
Exercise Compute a table showing how many MS1, 2, and 3 level scans are available in this data
table(msLevel(ms3))
##
## 1 2 3
## 30 482 482
The filterPrecursorScan function takes on raw data object, it’s acquisition number (get them with acquisitionNum), and returns a new raw data object containing the children of that spectrum.
Exercise Find the acquisition of the first MS1 spectrum and extract all spectra that originate, directly and indirectly, from it.
head(msLevel(ms3))
## F1.S001 F1.S002 F1.S003 F1.S004 F1.S005 F1.S006
## 1 2 2 3 2 2
head(acquisitionNum(ms3))
## F1.S001 F1.S002 F1.S003 F1.S004 F1.S005 F1.S006
## 21945 21946 21947 21948 21949 21950
(from1 <- filterPrecursorScan(ms3, 21945))
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.05 Mb
## - - - Spectra data - - -
## MS level(s): 1 2 3
## Number of spectra: 35
## MSn retention times: 45:27 - 45:30 minutes
## - - - Processing information - - -
## Data loaded [Thu Jul 12 11:31:01 2018]
## Filter: select parent/children scans for 21945 [Thu Jul 12 11:31:01 2018]
## MSnbase version: 2.6.1
## - - - Meta data - - -
## phenoData
## rowNames: MS3TMT11.mzML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## MS3TMT11.mzML
## protocolData: none
## featureData
## featureNames: F1.S001 F1.S002 ... F1.S035 (35 total)
## fvarLabels: fileIdx spIdx ... spectrum (29 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
msLevel(from1)
## F1.S001 F1.S002 F1.S003 F1.S004 F1.S005 F1.S006 F1.S007 F1.S008 F1.S009
## 1 2 2 3 2 2 3 2 2
## F1.S010 F1.S011 F1.S012 F1.S013 F1.S014 F1.S015 F1.S016 F1.S017 F1.S018
## 3 3 2 3 2 2 2 3 2
## F1.S019 F1.S020 F1.S021 F1.S022 F1.S023 F1.S024 F1.S025 F1.S026 F1.S027
## 2 3 2 2 2 3 2 2 3
## F1.S028 F1.S029 F1.S030 F1.S031 F1.S032 F1.S033 F1.S034 F1.S035
## 3 3 3 3 3 3 3 3
This section illustrates the underlying infrastructure from the mzR package, that is used by MSnbase under the hood. It is recommended to use the high level interfaces, as it supports multiple files and does data integrity checks throughout data processing.
The mzR package provides an interface to the proteowizard C/C++ code base to access various raw data files, such as mzML, mzXML, netCDF, and mzData. The data is accessed on-disk, i.e it is not loaded entirely in memory, and only when explicitly requested. The three main functions are openMSfile to create a file handle to a raw data file, header to extract metadata about the spectra contained in the file and peaks to extract one or multiple spectra of interest. Other functions such as instrumentInfo, or runInfo can be used to gather general information about a run.
Below, we access the raw data file downloaded in the previous section and open a file handle that will allow us to extract data and metadata of interest.
library("mzR")
basename(mzf)
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
ms <- openMSfile(mzf)
ms
## Mass Spectrometry file handle.
## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## Number of scans: 7534
The object loaded from AnnotationHub in the previous section is of the same type, and was also created by the openMSfile function. All operations below can equally be applied to it.
The header function returns the metadata of all available peaks:
hd <- header(ms)
dim(hd)
## [1] 7534 25
names(hd)
## [1] "seqNum" "acquisitionNum"
## [3] "msLevel" "polarity"
## [5] "peaksCount" "totIonCurrent"
## [7] "retentionTime" "basePeakMZ"
## [9] "basePeakIntensity" "collisionEnergy"
## [11] "ionisationEnergy" "lowMZ"
## [13] "highMZ" "precursorScanNum"
## [15] "precursorMZ" "precursorCharge"
## [17] "precursorIntensity" "mergedScan"
## [19] "mergedResultScanNum" "mergedResultStartScanNum"
## [21] "mergedResultEndScanNum" "injectionTime"
## [23] "filterString" "spectrumId"
## [25] "centroided"
We can extract metadata and scan data for scan 1000 as follows:
hd[1000, ]
## seqNum acquisitionNum msLevel polarity peaksCount totIonCurrent
## 1000 1000 1000 2 1 274 1048554
## retentionTime basePeakMZ basePeakIntensity collisionEnergy
## 1000 1106.916 136.061 164464 45
## ionisationEnergy lowMZ highMZ precursorScanNum precursorMZ
## 1000 0 104.5467 1370.758 992 683.0817
## precursorCharge precursorIntensity mergedScan mergedResultScanNum
## 1000 2 689443.7 0 0
## mergedResultStartScanNum mergedResultEndScanNum injectionTime
## 1000 0 0 55.21463
## filterString
## 1000 FTMS + p NSI d Full ms2 683.08@hcd45.00 [100.00-1380.00]
## spectrumId centroided
## 1000 controllerType=0 controllerNumber=1 scan=1000 TRUE
head(peaks(ms, 1000))
## [,1] [,2]
## [1,] 104.5467 308.9326
## [2,] 104.5684 308.6961
## [3,] 108.8340 346.7183
## [4,] 109.3928 365.1236
## [5,] 110.0345 616.7905
## [6,] 110.0703 429.1975
plot(peaks(ms, 1000), type = "h", xlab = "M/Z", ylab = "Intensity")
Manual extraction and plotting of an MS spectrum
See also this short video.
Below, we illustrate some additional visualisation and animations of raw MS data, taken from the RforProteomics visualisation vignette. On the left, we have a heatmap visualisation of a MS map and a 3 dimensional representation of the same data. On the right, 2 MS1 spectra in blue and the set of interleaves 10 MS2 spectra.
## (1) Open raw data file
ms <- openMSfile(mzf)
## (2) Extract the header information
hd <- header(ms)
## (3) MS1 spectra indices
ms1 <- which(hd$msLevel == 1)
## (4) Select MS1 spectra with retention time between 30 and 35 minutes
rtsel <- hd$retentionTime[ms1] / 60 > 30 & hd$retentionTime[ms1] / 60 < 35
## (5) Indices of the 1st and 2nd MS1 spectra after 30 minutes
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
## (6) Interleaved MS2 spectra
ms2 <- (i+1):(j-1)
## (1) MS space heaptmap
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd)
## 1
ff <- colorRampPalette(c("yellow", "steelblue"))
lattice::trellis.par.set(regions=list(col=ff(100)))
m1 <- plot(M, aspect = 1, allTicks = FALSE)
## (2) Same data as (1), in 3 dimenstion
M@map[msMap(M) == 0] <- NA
m2 <- plot3D(M, rgl = FALSE)
## (3) The 2 MS1 and 10 interleaved MS2 spectra from above
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
M2 <- MSmap(ms, i:j, 100, 1000, 1, hd)
## 1
m3 <- plot3D(M2)
gridExtra::grid.arrange(m1, m2, m3, ncol = 3)
Plotting MS maps along retention time, MZ range and intensity.
Below, we have animations build from extracting successive slices as above.
Let’s use the identification from from msdata:
idf <- msdata::ident(full.names = TRUE)
basename(idf)
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
The easiest way to read identification data in mzIdentML (often abbreviated with mzid) into R is to read it with readMzIdData, that will parse it, process it, and return a data.frame:
iddf <- readMzIdData(idf)
head(iddf)
## sequence spectrumID
## 1 RQCRTDFLNYLR controllerType=0 controllerNumber=1 scan=2949
## 2 ESVALADQVTCVDWRNRKATKK controllerType=0 controllerNumber=1 scan=6534
## 3 KELLCLAMQIIR controllerType=0 controllerNumber=1 scan=5674
## chargeState rank passThreshold experimentalMassToCharge
## 1 3 1 TRUE 548.2856
## 2 2 1 TRUE 1288.1528
## 3 2 1 TRUE 744.4109
## calculatedMassToCharge modNum isDecoy post pre start end DatabaseAccess
## 1 547.9474 1 FALSE V R 574 585 ECA2006
## 2 1288.1741 1 FALSE G R 69 90 ECA1676
## 3 744.4255 1 TRUE Q R 131 142 XXX_ECA2855
## DBseqLength DatabaseSeq
## 1 1295
## 2 110
## 3 157
## DatabaseDescription
## 1 ECA2006 ATP-dependent helicase
## 2 ECA1676 putative growth inhibitory protein
## 3
## acquisitionNum
## 1 2949
## 2 6534
## 3 5674
## spectrumFile
## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## 3 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## idFile
## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
## 3 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
## MS.GF.RawScore MS.GF.DeNovoScore MS.GF.SpecEValue MS.GF.EValue
## 1 10 101 4.617121e-08 0.1321981
## 2 12 121 7.255875e-08 0.2087481
## 3 8 74 9.341019e-08 0.2674533
## MS.GF.QValue MS.GF.PepQValue modName modMass modLocation
## 1 0.5254237 0.5490196 Carbamidomethyl 57.02146 3
## 2 0.6103896 0.6231884 Carbamidomethyl 57.02146 11
## 3 0.6250000 0.6363636 Carbamidomethyl 57.02146 5
## subOriginalResidue subReplacementResidue subLocation
## 1 <NA> <NA> NA
## 2 <NA> <NA> NA
## 3 <NA> <NA> NA
## [ reached getOption("max.print") -- omitted 3 rows ]
When adding identification data with the addIdentificationData function as shown above, the data is first read with readMzIdData, and is then cleaned up:
## at this stage, we still have all the PSMs
table(iddf$isDecoy)
##
## FALSE TRUE
## 2906 2896
table(iddf$rank)
##
## 1 2 3 4
## 5487 302 12 1
Exercise This behaviour can be replicates with the
filterIdentificationDataFramefunction. Try it out for yourself.
iddf2 <- filterIdentificationDataFrame(iddf)
table(iddf2$isDecoy)
##
## FALSE
## 2710
table(iddf2$rank)
##
## 1
## 2710
Exercise Data wrangling with identification data; the standard tidyverse tools are fit for purpose here. Extract and combine the PSMs and their scores as described above and combine them. From the available data, calculate the length of each peptide (you can use
ncharwith the peptide sequencesequence) and the number of peptides for each protein (DatabaseDescription). Plot the length of the proteins for their respective number of peptides. Optionally, stratify the plot by the peptide e-value score (MS.GF.EValue) using for examplecutto define bins.
suppressPackageStartupMessages(library("dplyr"))
iddf2 <- as_tibble(iddf2) %>%
mutate(peplen = nchar(sequence))
npeps <- iddf2 %>%
group_by(DatabaseDescription) %>%
tally
iddf2 <- full_join(iddf2, npeps)
## Joining, by = "DatabaseDescription"
library("ggplot2")
ggplot(iddf2, aes(x = n, y = DBseqLength)) + geom_point()
Identifcation data wrangling 1
iddf2$evalBins <- cut(iddf2$MS.GF.EValue, summary(iddf2$MS.GF.EValue))
ggplot(iddf2, aes(x = n, y = DBseqLength, color = peplen)) +
geom_point() +
facet_wrap(~ evalBins)
Along the lines of what is available for raw data, the parsing of this XML-based format comes from mzR. A file handle to mzIdentML files can be created with the openIDfile function. As for raw data, the underlying C/C++ code comes from the proteowizard.
library("mzR")
id1 <- openIDfile(idf)
id1
## Identification file handle.
## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
## Number of psms: 5759
Various data can be extracted from the identification object. The peptide spectrum matches (PSMs) and the identification scores can be accessed as a data.frame with psms and score respectively.
softwareInfo(id1)
## [1] "MS-GF+ Beta (v10072) "
## [2] "ProteoWizard MzIdentML 3.0.501 ProteoWizard"
enzymes(id1)
## name nTermGain cTermGain minDistance missedCleavages
## 1 Trypsin 0 1000
fid1 <- mzR::psms(id1)
head(fid1)
## spectrumID chargeState rank
## 1 controllerType=0 controllerNumber=1 scan=5782 3 1
## 2 controllerType=0 controllerNumber=1 scan=6037 3 1
## 3 controllerType=0 controllerNumber=1 scan=5235 3 1
## 4 controllerType=0 controllerNumber=1 scan=5397 3 1
## 5 controllerType=0 controllerNumber=1 scan=6075 3 1
## passThreshold experimentalMassToCharge calculatedMassToCharge
## 1 TRUE 1080.2325 1080.2321
## 2 TRUE 1002.2089 1002.2115
## 3 TRUE 1189.2836 1189.2800
## 4 TRUE 960.5365 960.5365
## 5 TRUE 1264.3409 1264.3419
## sequence modNum isDecoy post pre start end
## 1 PVQIQAGEDSNVIGALGGAVLGGFLGNTIGGGSGR 0 FALSE S R 50 84
## 2 TQVLDGLINANDIEVPVALIDGEIDVLR 0 FALSE R K 288 315
## 3 TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR 0 FALSE A R 192 224
## 4 SQILQQAGTSVLSQANQVPQTVLSLLR 0 FALSE - R 264 290
## 5 PIIGDNPFVVVLPDVVLDESTADQTQENLALLISR 0 FALSE F R 119 153
## DatabaseAccess DBseqLength DatabaseSeq
## 1 ECA1932 155
## 2 ECA1147 434
## 3 ECA0013 295
## 4 ECA1731 290
## 5 ECA1443 298
## DatabaseDescription acquisitionNum
## 1 ECA1932 outer membrane lipoprotein 5782
## 2 ECA1147 trigger factor 6037
## 3 ECA0013 ribose-binding periplasmic protein 5235
## 4 ECA1731 flagellin 5397
## 5 ECA1443 UTP--glucose-1-phosphate uridylyltransferase 6075
## [ reached getOption("max.print") -- omitted 1 row ]
sc1 <- mzR::score(id1)
head(sc1)
## spectrumID MS.GF.RawScore
## 1 controllerType=0 controllerNumber=1 scan=5782 147
## 2 controllerType=0 controllerNumber=1 scan=6037 214
## 3 controllerType=0 controllerNumber=1 scan=5235 211
## 4 controllerType=0 controllerNumber=1 scan=5397 154
## 5 controllerType=0 controllerNumber=1 scan=6075 188
## 6 controllerType=0 controllerNumber=1 scan=5761 123
## MS.GF.DeNovoScore MS.GF.SpecEValue MS.GF.EValue MS.GF.QValue
## 1 174 3.764831e-27 1.086033e-20 0
## 2 245 6.902626e-26 1.988774e-19 0
## 3 264 1.778789e-25 5.129649e-19 0
## 4 178 1.792541e-24 5.163566e-18 0
## 5 252 1.510364e-23 4.356914e-17 0
## 6 138 1.618941e-23 4.658952e-17 0
## MS.GF.PepQValue
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
The mzID package, has similar functionality to parse identification files, and was the first one to provide such capabilities in R. The main difference with mzR is that is parses the files using the XMLpackage and reads the whole data into memory rather than relying on proteowizard, and is slower.
While searches are generally performed using third-party software independently of R or can be started from R using a system call, the MSGFplus package enables to perform a search using the MSGF+ engine, as illustrated below.
We search the TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML file against the fasta file from PXD000001 using MSGFplus.
We first download the fasta files from ProteomeXchange (that file is also available in the proteomics lab directory on the CSAMA workshop).
fas <- pxget(px, "erwinia_carotovora.fasta")
## Downloading 1 file
basename(fas)
## [1] "erwinia_carotovora.fasta"
Below, we setup and run the search5 In the runMSGF call, the memory allocated to the java virtual machine is limited to 1GB. In general, there is no need to specify this argument, unless you experience an error regarding the maximum heap size..
library("MSGFplus")
msgfpar <- msgfPar(database = fas,
instrument = 'HighRes',
tda = TRUE,
enzyme = 'Trypsin',
protocol = 'iTRAQ')
idres <- runMSGF(msgfpar, mzf, memory=1000)
## '/usr/bin/java' -Xmx1000M -jar '/home/lg390/R/x86_64-pc-linux-gnu-library/3.4/MSGFplus/MSGFPlus/MSGFPlus.jar' -s '/home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML' -o '/home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid' -d '/home/lg390/Documents/Teaching/bioc-ms-prot/erwinia_carotovora.fasta' -tda 1 -inst 1 -e 1 -protocol 2
##
## reading TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid... DONE!
idres
## An mzID object
##
## Software used: MS-GF+ (version: Beta (v10072))
##
## Rawfile: /home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
##
## Database: /home/lg390/Documents/Teaching/bioc-ms-prot/erwinia_carotovora.fasta
##
## Number of scans: 5343
## Number of PSM's: 5656
## identification file (needed below)
basename(mzID::files(idres)$id)
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
A graphical interface to perform the search the data and explore the results is also available:
library("MSGFgui")
MSGFgui()
The MSGFgui interface
The rTANDEM package can be used to perform a search with XTandem software.
The MSnID package can be used for post-search filtering of MS/MS identifications. One starts with the construction of an MSnID object that is populated with identification results that can be imported from a data.frame or from mzIdenML files. Here, we will use the example identification data provided with the package.
mzids <- system.file("extdata", "c_elegans.mzid.gz", package="MSnID")
basename(mzids)
## [1] "c_elegans.mzid.gz"
We start by loading the package, initialising the MSnID object, and add the identification result from our mzid file (there could of course be more that one).
library("MSnID")
msnid <- MSnID(".")
## Note, the anticipated/suggested columns in the
## peptide-to-spectrum matching results are:
## -----------------------------------------------
## accession
## calculatedMassToCharge
## chargeState
## experimentalMassToCharge
## isDecoy
## peptide
## spectrumFile
## spectrumID
msnid <- read_mzIDs(msnid, mzids)
## Loaded cached data
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 1
## #PSMs: 12263 at 36 % FDR
## #peptides: 9489 at 44 % FDR
## #accessions: 7414 at 76 % FDR
Printing the MSnID object returns some basic information such as
The package then enables to define, optimise and apply filtering based for example on missed cleavages, identification scores, precursor mass errors, etc. and assess PSM, peptide and protein FDR levels. To properly function, it expects to have access to the following data
## [1] "accession" "calculatedMassToCharge"
## [3] "chargeState" "experimentalMassToCharge"
## [5] "isDecoy" "peptide"
## [7] "spectrumFile" "spectrumID"
which are indeed present in our data:
names(msnid)
## [1] "spectrumID" "scan number(s)"
## [3] "acquisitionNum" "passThreshold"
## [5] "rank" "calculatedMassToCharge"
## [7] "experimentalMassToCharge" "chargeState"
## [9] "MS-GF:DeNovoScore" "MS-GF:EValue"
## [11] "MS-GF:PepQValue" "MS-GF:QValue"
## [13] "MS-GF:RawScore" "MS-GF:SpecEValue"
## [15] "AssumedDissociationMethod" "IsotopeError"
## [17] "isDecoy" "post"
## [19] "pre" "end"
## [21] "start" "accession"
## [23] "length" "description"
## [25] "pepSeq" "modified"
## [27] "modification" "idFile"
## [29] "spectrumFile" "databaseFile"
## [31] "peptide"
Here, we summarise a few steps and redirect the reader to the package’s vignette for more details:
Cleaning irregular cleavages at the termini of the peptides and missing cleavage site within the peptide sequences. The following two function call create the new numMisCleavages and numIrrCleabages columns in the MSnID object
msnid <- assess_termini(msnid, validCleavagePattern="[KR]\\.[^P]")
msnid <- assess_missed_cleavages(msnid, missedCleavagePattern="[KR](?=[^P$])")
Now, we can use the apply_filter function to effectively apply filters. The strings passed to the function represent expressions that will be evaluated, this keeping only PSMs that have 0 irregular cleavages and 2 or less missed cleavages.
msnid <- apply_filter(msnid, "numIrregCleavages == 0")
msnid <- apply_filter(msnid, "numMissCleavages <= 2")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 1
## #PSMs: 7838 at 17 % FDR
## #peptides: 5598 at 23 % FDR
## #accessions: 3759 at 53 % FDR
Using "calculatedMassToCharge" and "experimentalMassToCharge", the mass_measurement_error function calculates the parent ion mass measurement error in parts per million.
summary(mass_measurement_error(msnid))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2184.0640 -0.6992 0.0000 17.6146 0.7512 2012.5178
We then filter any matches that do not fit the +/- 20 ppm tolerance
msnid <- apply_filter(msnid, "abs(mass_measurement_error(msnid)) < 20")
summary(mass_measurement_error(msnid))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -19.7797 -0.5866 0.0000 -0.2970 0.5713 19.6758
Filtering of the identification data will rely on
msnid$msmsScore <- -log10(msnid$`MS-GF:SpecEValue`)
msnid$absParentMassErrorPPM <- abs(mass_measurement_error(msnid))
MS2 filters are handled by a special MSnIDFilter class objects, where individual filters are set by name (that is present in names(msnid)) and comparison operator (>, <, = , …) defining if we should retain hits with higher or lower given the threshold and finally the threshold value itself.
filtObj <- MSnIDFilter(msnid)
filtObj$absParentMassErrorPPM <- list(comparison="<", threshold=10.0)
filtObj$msmsScore <- list(comparison=">", threshold=10.0)
show(filtObj)
## MSnIDFilter object
## (absParentMassErrorPPM < 10) & (msmsScore > 10)
We can then evaluate the filter on the identification data object, which return the false discovery rate and number of retained identifications for the filtering criteria at hand.
evaluate_filter(msnid, filtObj)
## fdr n
## PSM 0 3807
## peptide 0 2455
## accession 0 1009
Rather than setting filtering values by hand, as shown above, these can be set automativally to meet a specific false discovery rate.
filtObj.grid <- optimize_filter(filtObj, msnid, fdr.max=0.01,
method="Grid", level="peptide",
n.iter=500)
show(filtObj.grid)
## MSnIDFilter object
## (absParentMassErrorPPM < 3) & (msmsScore > 7.4)
evaluate_filter(msnid, filtObj.grid)
## fdr n
## PSM 0.004097561 5146
## peptide 0.006447651 3278
## accession 0.021996616 1208
Filters can eventually be applied (rather than just evaluated) using the apply_filter function.
msnid <- apply_filter(msnid, filtObj.grid)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 1
## #PSMs: 5146 at 0.41 % FDR
## #peptides: 3278 at 0.64 % FDR
## #accessions: 1208 at 2.2 % FDR
And finally, identifications that matched decoy and contaminant protein sequences are removed
msnid <- apply_filter(msnid, "isDecoy == FALSE")
msnid <- apply_filter(msnid, "!grepl('Contaminant',accession)")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 1
## #PSMs: 5117 at 0 % FDR
## #peptides: 3251 at 0 % FDR
## #accessions: 1179 at 0 % FDR
The resulting filtered identification data can be exported to a data.frame or to a dedicated MSnSet data structure for quantitative MS data, described below, and further processed and analyses using appropriate statistical tests.
Annotated spectra and comparing spectra.
par(mfrow = c(1, 2))
data(itraqdata)
itraqdata2 <- pickPeaks(itraqdata, verbose = FALSE) ## centroiding
s <- "SIGFEGDSIGR"
plot(itraqdata2[[14]], s, main = s)
plot(itraqdata2[[25]], itraqdata2[[28]], sequences = rep("IMIDLDGTENK", 2))
Annotating and comparing MS2 spectra.
The annotation of spectra is obtained by simulating fragmentation of a peptide and matching observed peaks to fragments:
calculateFragments("SIGFEGDSIGR")
## mz ion type pos z seq
## 1 88.03931 b1 b 1 1 S
## 2 201.12337 b2 b 2 1 SI
## 3 258.14483 b3 b 3 1 SIG
## 4 405.21324 b4 b 4 1 SIGF
## 5 534.25583 b5 b 5 1 SIGFE
## 6 591.27729 b6 b 6 1 SIGFEG
## 7 706.30423 b7 b 7 1 SIGFEGD
## 8 793.33626 b8 b 8 1 SIGFEGDS
## 9 906.42032 b9 b 9 1 SIGFEGDSI
## 10 963.44178 b10 b 10 1 SIGFEGDSIG
## 11 175.11895 y1 y 1 1 R
## 12 232.14041 y2 y 2 1 GR
## 13 345.22447 y3 y 3 1 IGR
## 14 432.25650 y4 y 4 1 SIGR
## 15 547.28344 y5 y 5 1 DSIGR
## 16 604.30490 y6 y 6 1 GDSIGR
## [ reached getOption("max.print") -- omitted 16 rows ]
Visualising a pair of spectra means that we can access them, and that, in addition to plotting, we can manipulate them and perform computations. The two spectra corresponding to the IMIDLDGTENK peptide, for example have 22 common peaks, a correlation of 0.198 and a dot product of 0.21 (see ?compareSpectra for details).
There are a wide range of proteomics quantitation techniques that can broadly be classified as labelled vs. label-free, depending whether the features are labelled prior the MS acquisition and the MS level at which quantitation is inferred, namely MS1 or MS2.
| Label-free | Labelled | |
|---|---|---|
| MS1 | XIC | SILAC, 15N |
| MS2 | Counting | iTRAQ, TMT |
In terms of raw data quantitation, most efforts have been devoted to MS2-level quantitation. Label-free XIC quantitation has however been addressed in the frame of metabolomics data processing by the xcms infrastructure.
Below is a list of suggested packages for some common proteomics quantitation technologies:
An MSnExp is converted to an MSnSet by the quantitation method. Below, we use the iTRAQ 4-plex isobaric tagging strategy (defined by the iTRAQ4 parameter; other tags are available: see ?ReporterIons) and the max method to calculate the use the maximum of the reporter peak for quantitation.
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
MS2 spectrum and it’s iTRAQ4 reporter ions.
msset <- quantify(msexp, method = "max", reporters = iTRAQ4)
The figure below give a schematics of an MSnSet instance and the relation between the assay data and the respective feature and sample metadata, accessible respectively with the exprs, fData and pData functions.
MSnSet structure
New columns can be added to the metadata slots.
exprs(msset)
## iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## F1.S1 706555.7 685055.1 929016.1 668245.2
## F1.S2 260663.7 212745.0 163782.8 239142.7
## F1.S3 2213566.0 2069209.6 2204032.2 2331846.8
## F1.S4 616043.4 705976.6 671828.8 666845.6
## F1.S5 1736128.2 1787622.5 1795311.8 1825523.0
pData(msset)
## mz reporters
## iTRAQ4.114 114.1112 iTRAQ4
## iTRAQ4.115 115.1083 iTRAQ4
## iTRAQ4.116 116.1116 iTRAQ4
## iTRAQ4.117 117.1150 iTRAQ4
pData(msset)$groups <- rep(c("Treat", "Cond"), each = 2)
pData(msset)
## mz reporters groups
## iTRAQ4.114 114.1112 iTRAQ4 Treat
## iTRAQ4.115 115.1083 iTRAQ4 Treat
## iTRAQ4.116 116.1116 iTRAQ4 Cond
## iTRAQ4.117 117.1150 iTRAQ4 Cond
Another useful slot is processingData, accessed with processingData(.), that records all the processing that objects have undergone since their creation.
processingData(msset)
## - - - Processing information - - -
## Data loaded: Thu Jul 12 11:30:37 2018
## iTRAQ4 quantification by max: Thu Jul 12 11:31:11 2018
## MSnbase version: 2.6.1
See also The isobar package supports quantitation from centroided mgf peak lists or its own tab-separated files that can be generated from Mascot and Phenyx vendor files.
Other MS2 quantitation methods available in quantify include the (normalised) spectral index SI and (normalised) spectral abundance factor SAF or simply a simple count method6 The code below is for illustration only - it doesn’t make much sense to perform any of these quantitations on such a multiplexed data.
exprs(si <- quantify(msexp, method = "SIn"))
## dummyiTRAQ.mzXML
## ECA0510 0.0006553518
## ECA0984 0.0035384487
## ECA1028 0.0002684726
exprs(saf <- quantify(msexp, method = "NSAF"))
## dummyiTRAQ.mzXML
## ECA0510 0.4306167
## ECA0984 0.3094475
## ECA1028 0.2599359
Note that spectra that have not been assigned any peptide (NA) or that match non-unique peptides (npsm > 1) are discarded in the counting process.
As shown above, the MSnID package enables to explore and assess the confidence of identification data using mzid files. A subset of all peptide-spectrum matches, that pass a specific false discovery rate threshold can them be converted to an MSnSet, where the number of peptide occurrences are used to populate the assay data.
MzTab filesThe Proteomics Standard Initiative (PSI) mzTab file format is aimed at providing a simpler (than XML formats) and more accessible file format to the wider community. It is composed of a key-value metadata section and peptide/protein/small molecule tabular sections. These data can be imported with the readMzTabData function7 We specify version 0.9 (which generates the warning) to fit with the version of that file. For recent files, the version argument should be ignored to use the importer for the current file version 1.0..
mztf <- pxget(px, "F063721.dat-mztab.txt")
## Downloading 1 file
(mzt <- readMzTabData(mztf, what = "PEP", version = "0.9"))
## Warning: Version 0.9 is deprecated. Please see '?readMzTabData' and '?
## MzTab' for details.
## MSnSet (storageMode: lockedEnvironment)
## assayData: 1528 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: sub[1] sub[2] ... sub[6] (6 total)
## varLabels: abundance
## varMetadata: labelDescription
## featureData
## featureNames: 1 2 ... 1528 (1528 total)
## fvarLabels: sequence accession ... uri (14 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## mzTab read: Mon Jun 19 23:10:32 2017
## MSnbase version: 2.3.6
The mzTab file is also available in the is also available in the proteomics lab directory in the CSAMA workshop server.
It is also possible to import arbitrary spreadsheets (such as those exported by MaxQuant, ProteomeDiscoverer, …) as MSnSet objects into R with the readMSnSet2 function. The main 2 arguments of the function are (1) a text-based spreadsheet and (2) column names of indices that identify the quantitation data. The latter can be queried with the getEcols function.
csv <- dir(system.file ("extdata" , package = "pRolocdata"),
full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv")
getEcols(csv, split = ",")
## [1] "\"Protein ID\"" "\"FBgn\""
## [3] "\"Flybase Symbol\"" "\"No. peptide IDs\""
## [5] "\"Mascot score\"" "\"No. peptides quantified\""
## [7] "\"area 114\"" "\"area 115\""
## [9] "\"area 116\"" "\"area 117\""
## [11] "\"PLS-DA classification\"" "\"Peptide sequence\""
## [13] "\"Precursor ion mass\"" "\"Precursor ion charge\""
## [15] "\"pd.2013\"" "\"pd.markers\""
ecols <- 7:10
res <- readMSnSet2(csv, ecols)
head(exprs(res))
## area.114 area.115 area.116 area.117
## 1 0.379000 0.281000 0.225000 0.114000
## 2 0.420000 0.209667 0.206111 0.163889
## 3 0.187333 0.167333 0.169667 0.476000
## 4 0.247500 0.253000 0.320000 0.179000
## 5 0.216000 0.183000 0.342000 0.259000
## 6 0.072000 0.212333 0.573000 0.142667
head(fData(res))
## Protein.ID FBgn Flybase.Symbol No..peptide.IDs Mascot.score
## 1 CG10060 FBgn0001104 G-ialpha65A 3 179.86
## 2 CG10067 FBgn0000044 Act57B 5 222.40
## 3 CG10077 FBgn0035720 CG10077 5 219.65
## 4 CG10079 FBgn0003731 Egfr 2 86.39
## 5 CG10106 FBgn0029506 Tsp42Ee 1 52.10
## 6 CG10130 FBgn0010638 Sec61beta 2 79.90
## No..peptides.quantified PLS.DA.classification Peptide.sequence
## 1 1 PM
## 2 9 PM
## 3 3
## 4 2 PM
## 5 1 GGVFDTIQK
## 6 3 ER/Golgi
## Precursor.ion.mass Precursor.ion.charge pd.2013 pd.markers
## 1 PM unknown
## 2 PM unknown
## 3 unknown unknown
## 4 PM unknown
## 5 626.887 2 Phenotype 1 unknown
## 6 ER/Golgi ER
Exercise Using
readMSnSet2, load the following file that was part of the supplementary information of a manuscript.
csvfile <- dir(system.file("extdata", package = "pRolocdata"),
pattern = "hyperLOPIT-SIData-ms3-rep12-intersect.csv",
full.names = TRUE)
basename(csvfile)
## [1] "hyperLOPIT-SIData-ms3-rep12-intersect.csv.gz"
You’ll first need to identify which columns to use as expression data. In this case however, two rows are used as header, and you’ll need to set
ningetEcolsto retrieve the appropriate one. There are 20 expresion columns annotated as TMT 10 plex reporter ion M/Z values (if you don’t know these, you can find them out by looking at theTMT10reporter ion object). You can now usereadMSnSet2, remembering to skip 1 line and, optionally, use the first column as feature names (see thefnamesargument). What are the number of features and samples in the data?
getEcols(csvfile, split = ",", n = 2)
## [1] ""
## [2] ""
## [3] ""
## [4] "Experiment 1"
## [5] "Experiment 2"
## [6] "Experiment 1"
## [7] "Experiment 2"
## [8] "126"
## [9] "127N"
## [10] "127C"
## [11] "128N"
## [12] "128C"
## [13] "129N"
## [14] "129C"
## [15] "130N"
## [16] "130C"
## [17] "131"
## [18] "126"
## [19] "127N"
## [20] "127C"
## [21] "128N"
## [22] "128C"
## [23] "129N"
## [24] "129C"
## [25] "130N"
## [26] "130C"
## [27] "131"
## [28] "phenoDisco Input"
## [29] "phenoDisco Output"
## [30] "Curated phenoDisco Output"
## [31] "SVM marker set"
## [32] "SVM classification"
## [33] "SVM score"
## [34] "SVM classification (top quartile)"
## [35] "Final Localization Assignment"
## [36] "First localization evidence?"
## [37] "Curated Organelles"
## [38] "Cytoskeletal Components"
## [39] "Trafficking Proteins"
## [40] "Protein Complexes"
## [41] "Signaling Cascades"
## [42] "Oct4 Interactome"
## [43] "Nanog Interactome"
## [44] "Sox2 Interactome"
## [45] "Cell Surface Proteins"
hl <- readMSnSet2(csvfile, ecol = 8:27, fnames = 1, skip = 1)
him(hl)
## Error in him(hl): could not find function "him"
For raw data processing look at MSnbase’s clean, smooth, pickPeaks, removePeaks and trimMz for MSnExp and spectra processing methods.
As an illustration, we show the pickPeaks function on the itraqdata data. Centoiding transforms the distribution of M/Z values measured for an ion (i.e. a set of M/Z and intensities, first figure below) into a single M/Z and intensity pair of values (second figure below).
library("ggplot2") ## for coord_cartesian
data(itraqdata)
plot(itraqdata[[10]], full = TRUE) +
coord_cartesian(xlim = c(915, 925))
Peak picking: profile mode.
itraqdata2 <- pickPeaks(itraqdata)
plot(itraqdata2[[10]], full = TRUE, w1 = 0.05) +
coord_cartesian(xlim = c(915, 925))
Peak picking: centroided.
The MALDIquant and xcms packages also features a wide range of raw data processing methods on their own ad hoc data instance types.
Each different types of quantitative data will require their own pre-processing and normalisation steps. Both isobar and MSnbase allow to correct for isobaric tag impurities normalise the quantitative data.
data(itraqdata)
qnt <- quantify(itraqdata, method = "trap", reporters = iTRAQ4)
impurities <- matrix(c(0.929, 0.059, 0.002, 0.000,
0.020, 0.923, 0.056, 0.001,
0.000, 0.030, 0.924, 0.045,
0.000, 0.001, 0.040, 0.923),
nrow = 4, byrow = TRUE)
## or, using makeImpuritiesMatrix()
## impurities <- makeImpuritiesMatrix(4)
qnt <- purityCorrect(qnt, impurities)
processingData(qnt)
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## iTRAQ4 quantification by trapezoidation: Thu Jul 12 11:31:14 2018
## Purity corrected: Thu Jul 12 11:31:14 2018
## MSnbase version: 1.1.22
Various normalisation methods can be applied the MSnSet instances using the normalise method: variance stabilisation (vsn), quantile (quantiles), median or mean centring (center.media or center.mean), …
qnt <- normalise(qnt, "quantiles")
processingData(qnt)
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## iTRAQ4 quantification by trapezoidation: Thu Jul 12 11:31:14 2018
## Purity corrected: Thu Jul 12 11:31:14 2018
## Normalised (quantiles): Thu Jul 12 11:31:14 2018
## MSnbase version: 1.1.22
The combineFeatures method combines spectra/peptides quantitation values into protein data. The grouping is defined by the groupBy parameter, which is generally taken from the feature metadata (protein accessions, for example).
gb <- fData(qnt)$ProteinDescription
prt <- combineFeatures(qnt, groupBy = gb, fun = "median")
## Your data contains missing values. Please read the relevant
## section in the combineFeatures manual page for details the effects
## of missing values on data aggregation.
processingData(prt)
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## iTRAQ4 quantification by trapezoidation: Thu Jul 12 11:31:14 2018
## Purity corrected: Thu Jul 12 11:31:14 2018
## Normalised (quantiles): Thu Jul 12 11:31:14 2018
## Combined 55 features into 39 using median: Thu Jul 12 11:31:14 2018
## MSnbase version: 2.6.1
Finally, proteomics data analysis is generally hampered by missing values. Missing data imputation is a sensitive operation whose success will be guided by many factors, such as degree and (non-)random nature of the missingness.
Below, we load an MSnSet with missing values, count the number missing and non-missing values.
data(naset)
table(is.na(naset))
##
## FALSE TRUE
## 10254 770
The naplot figure will reorder cells within the data matrix so that the experiments and features with many missing values will be grouped towards the top and right of the heatmap, and barplots at the top and right summarise the number of missing values in the respective samples (column) and rows (rows).
naplot(naset)
Overview of missing values.
The importance of missing values in a dataset will depend on the quantitation technology employed. Label-free quantitation in particular can suffer from a very high number of missing values.
Missing value in MSnSet instances can be filtered out with the filterNA functions. By default, it removes features that contain at least NA value.
## remove features with missing values
tmp <- filterNA(naset)
processingData(tmp)
## - - - Processing information - - -
## Subset [689,16][301,16] Thu Jul 12 11:31:15 2018
## Removed features with more than 0 NAs: Thu Jul 12 11:31:15 2018
## Dropped featureData's levels Thu Jul 12 11:31:15 2018
## MSnbase version: 1.15.6
It is of course possible to impute missing values (?impute). This is however not a straightforward thing, as is likely to dramatically fail when a high proportion of data is missing (10s of %)8 Note that when using limma for instance, downstream analyses can handle missing values. Still, it is recommended to explore missingness as part of the exploratory data analysis.. But also, there are two types of mechanisms resulting in missing values in LC/MSMS experiments.
Missing values resulting from absence of detection of a feature, despite ions being present at detectable concentrations. For example in the case of ion suppression or as a result from the stochastic, data-dependent nature of the MS acquisition method. These missing value are expected to be randomly distributed in the data and are defined as missing at random (MAR) or missing completely at random (MCAR).
Biologically relevant missing values, resulting from the absence or the low abundance of ions (below the limit of detection of the instrument). These missing values are not expected to be randomly distributed in the data and are defined as missing not at random (MNAR).
Random and non-random missing values.
Different imputation methods are more appropriate to different classes of missing values (as documented in this paper). Values missing at random, and those missing not at random should be imputed with different methods.
Root-mean-square error (RMSE) observations standard deviation ratio (RSR), KNN and MinDet imputation. Lower (blue) is better. (See here for details)
Generally, it is recommended to use hot deck methods (nearest neighbour (left), maximum likelihood, …) when data are missing at random.Conversely, MNAR features should ideally be imputed with a left-censor (minimum value (right), but not zero, …) method.
## impute missing values using knn imputation
tmp <- impute(naset, method = "knn")
## Warning in knnimp(x, k, maxmiss = rowmax, maxp = maxp): 12 rows with more than 50 % entries missing;
## mean imputation used for these rows
processingData(tmp)
## - - - Processing information - - -
## Data imputation using knn Thu Jul 12 11:31:15 2018
## Using default parameters
## MSnbase version: 1.15.6
There are various methods to perform data imputation, as described in ?impute.
R in general and Bioconductor in particular are well suited for the statistical analysis of data of quantitative proteomics data. Several packages provide dedicated resources for proteomics data:
MSstats: A set of tools for statistical relative protein significance analysis in Data dependent (DDA), SRM and Data independent acquisition (DIA) experiments. Data stored in data.frame or MSnSet objects can be used as input.
msmsTests: Statistical tests for label-free LC-MS/MS data by spectral counts, to discover differentially expressed proteins between two biological conditions. Three tests are available: Poisson GLM regression, quasi-likelihood GLM regression, and the negative binomial of the edgeR package. All can be readily applied on MSnSet instances produced, for example by MSnID.
isobar also provides dedicated infrastructure for the statistical analysis of isobaric data.
DEP provides an integrated analysis workflow for the analysis of mass spectrometry proteomics data for differential protein expression or differential enrichment.
The MLInterfaces package provides a unified interface to a wide range of machine learning algorithms. Initially developed for microarray and ExpressionSet instances, the pRoloc package enables application of these algorithms to MSnSet data.
Dimensionality reduction is very frequently used to summarise high-dimensional data. Below we will use principal component analysis (PCA), but other methods can be applied. Below, we will use the plot2D function from the pRoloc package9 While originally developed for the analysis of spatial/organelle proteomics data in mind, it is applicable many use cases., that will extract the expression values in the assay data, perform dimensionality reduction, an produce the scatter plot.
Let’s first use plot2D to visualise the pattern in 20 protein quantitation values (initial 20 dimensional data). Here, we use an example from spatial proteomics, where the quantitative protein profiles reflect the proteins sub-cellular localisation (from Christoforou et al, 2016, see also Breckels et al, 2016 for more data analysis background). We will use the known localisation of some proteins (marker proteins) to annotate the plot (using the fcol argument).
library("pRoloc")
library("pRolocdata")
data(hyperLOPIT2015)
plot2D(hyperLOPIT2015, fcol = "markers")
addLegend(hyperLOPIT2015, fcol = "markers", cex = .7)
PCA plot for protein sub-cellular localisation.
In other cases, we want to visualise the relation of samples. plot2D uses the rows of the data to perform dimensionality reduction. To use the columns, we just need to transpose the MSnSet. By doing so, the pData becomes the fData and vice versa.
Let’s use a time-course experiment on stem cells (Mulvey et al. 2015). Below, we use the times (time points) variable to set colours and rep (replicate numbers) to set the plotting characters.
data(mulvey2015)
head(pData(mulvey2015))
## rep times cond
## rep1_0hr 1 1 1
## rep1_16hr 1 2 1
## rep1_24hr 1 3 1
## rep1_48hr 1 4 1
## rep1_72hr 1 5 1
## rep1_XEN 1 6 1
plot2D(t(mulvey2015), fcol = "times", fpch = "rep", cex = 2)
addLegend(t(mulvey2015), fcol = "times")
PCA plots for sample in a time-course experiment.
The example below uses knn with the 5 closest neighbours and the MLInterfaces package as an illustration to classify proteins of unknown sub-cellular localisation to one of 9 possible organelles.
library("MLInterfaces")
library("pRolocdata")
data(dunkley2006)
traininds <- which(fData(dunkley2006)$markers != "unknown")
ans <- MLearn(markers ~ ., data = t(dunkley2006), knnI(k = 5), traininds)
ans
## MLInterfaces classification output container
## The call was:
## MLearn(formula = markers ~ ., data = t(dunkley2006), .method = knnI(k = 5),
## trainInd = traininds)
## Predicted outcome distribution for test set:
##
## ER lumen ER membrane Golgi Mitochondrion Plastid
## 5 140 67 51 29
## PM Ribosome TGN vacuole
## 89 31 6 10
## Summary of scores on test set (use testScores() method for details):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 1.0000 1.0000 0.9332 1.0000 1.0000
A wide range of classification and clustering algorithms are also available, as described in the ?MLearn documentation page, used below.
kcl <- MLearn( ~ ., data = dunkley2006, kmeansI, centers = 12)
kcl
## clusteringOutput: partition table
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 55 28 29 18 60 30 44 145 78 116 33 53
## The call that created this object was:
## MLearn(formula = ~., data = dunkley2006, .method = kmeansI, centers = 12)
plot(kcl, exprs(dunkley2006))
Kmeans clustering using r Biocpkg('MLInterfaces') with an MSnSet object.
All the Bioconductor annotation infrastructure, such as biomaRt, GO.db, organism specific annotations, .. are directly relevant to the analysis of proteomics data. A total of 191 ontologies, including some proteomics-centred annotations such as the PSI Mass Spectrometry Ontology, Molecular Interaction (PSI MI 2.5) or Protein Modifications are available through the rols
library("rols")
res <- OlsSearch(q = "ESI", ontology = "MS", exact = TRUE)
res
## Object of class 'OlsSearch':
## ontolgy: MS
## query: ESI
## requested: 20 (out of 1)
## response(s): 0
There is a single exact match (default is to retrieve 20 results), that can be retrieved and coerced to a Terms or data.frame object with
res <- olsSearch(res)
as(res, "Terms")
## Object of class 'Terms' with 1 entries
## From the MS ontology
## MS:1000073
as(res, "data.frame")
## id
## 1 ms:class:http://purl.obolibrary.org/obo/MS_1000073
## iri short_form obo_id
## 1 http://purl.obolibrary.org/obo/MS_1000073 MS_1000073 MS:1000073
## label
## 1 electrospray ionization
## description
## 1 A process in which ionized species in the gas phase are produced from an analyte-containing solution via highly charged fine droplets, by means of spraying the solution from a narrow-bore needle tip at atmospheric pressure in the presence of a high electric field. When a pressurized gas is used to aid in the formation of a stable spray, the term pneumatically assisted electrospray ionization is used. The term ion spray is not recommended.
## ontology_name ontology_prefix type is_defining_ontology
## 1 ms MS class TRUE
Data from the Human Protein Atlas is available via the hpar package.
The best place to ask questions about MS-based proteomics and relevant Bioconductor package is the Bioconductor support forum. Tagging you question with Proteomics or specific package names will alert the respective maintainers.
sessionInfo()
## R version 3.5.0 Patched (2018-05-14 r74725)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggplot2_3.0.0 bindrcpp_0.2.2 dplyr_0.7.6
## [4] msdata_0.20.0 knitr_1.20 hpar_1.22.2
## [7] rols_2.8.1 MSGFplus_1.14.0 pRolocdata_1.18.0
## [10] rpx_1.16.0 MSnID_1.14.0 mzID_1.18.0
## [13] AnnotationHub_2.12.0 pRoloc_1.20.1 MLInterfaces_1.60.1
## [16] cluster_2.0.7-1 annotate_1.58.0 XML_3.98-1.11
## [19] AnnotationDbi_1.42.1 IRanges_2.14.10 S4Vectors_0.18.3
## [22] gridExtra_2.3 lattice_0.20-35 RforProteomics_1.18.1
## [25] BiocInstaller_1.30.0 MSnbase_2.6.1 ProtGenerics_1.12.0
## [28] BiocParallel_1.14.2 mzR_2.14.0 Rcpp_0.12.17
## [31] Biobase_2.40.0 BiocGenerics_0.26.0 BiocStyle_2.8.2
##
## loaded via a namespace (and not attached):
## [1] R.utils_2.6.0 RUnit_0.4.32
## [3] tidyselect_0.2.4 RSQLite_2.1.1
## [5] htmlwidgets_1.2 grid_3.5.0
## [7] trimcluster_0.1-2 lpSolve_5.6.13
## [9] rda_1.0.2-2 munsell_0.5.0
## [11] codetools_0.2-15 preprocessCore_1.42.0
## [13] DT_0.4 withr_2.1.2
## [15] colorspace_1.3-2 Category_2.46.0
## [17] highr_0.7 geometry_0.3-6
## [19] robustbase_0.93-1 dimRed_0.1.0
## [21] labeling_0.3 mnormt_1.5-5
## [23] hwriter_1.3.2 bit64_0.9-7
## [25] ggvis_0.4.3 rprojroot_1.3-2
## [27] ipred_0.9-6 randomForest_4.6-14
## [29] diptest_0.75-7 R6_2.2.2
## [31] doParallel_1.0.11 gridSVG_1.6-0
## [33] flexmix_2.3-14 DRR_0.0.3
## [35] bitops_1.0-6 assertthat_0.2.0
## [37] promises_1.0.1 scales_0.5.0
## [39] nnet_7.3-12 gtable_0.2.0
## [41] affy_1.58.0 biocViews_1.48.2
## [43] ddalpha_1.3.4 timeDate_3043.102
## [45] rlang_0.2.1 CVST_0.2-2
## [47] genefilter_1.62.0 RcppRoll_0.3.0
## [49] splines_3.5.0 lazyeval_0.2.1
## [51] ModelMetrics_1.1.0 impute_1.54.0
## [53] hexbin_1.27.2 broom_0.4.5
## [55] yaml_2.1.19 reshape2_1.4.3
## [57] abind_1.4-5 threejs_0.3.1
## [59] crosstalk_1.0.0 backports_1.1.2
## [61] httpuv_1.4.4.2 RBGL_1.56.0
## [63] caret_6.0-80 tools_3.5.0
## [65] lava_1.6.2 psych_1.8.4
## [67] gplots_3.0.1 affyio_1.50.0
## [69] RColorBrewer_1.1-2 proxy_0.4-22
## [71] plyr_1.8.4 base64enc_0.1-3
## [73] progress_1.2.0 zlibbioc_1.26.0
## [75] purrr_0.2.5 RCurl_1.95-4.10
## [77] prettyunits_1.0.2 rpart_4.1-13
## [79] viridis_0.5.1 sampling_2.8
## [81] sfsmisc_1.1-2 magrittr_1.5
## [83] data.table_1.11.4 pcaMethods_1.72.0
## [85] mvtnorm_1.0-8 whisker_0.3-2
## [87] R.cache_0.13.0 hms_0.4.2
## [89] mime_0.5 evaluate_0.10.1
## [91] xtable_1.8-2 mclust_5.4.1
## [93] compiler_3.5.0 biomaRt_2.36.1
## [95] tibble_1.4.2 KernSmooth_2.23-15
## [97] crayon_1.3.4 R.oo_1.22.0
## [99] htmltools_0.3.6 later_0.7.3
## [101] tidyr_0.8.1 lubridate_1.7.4
## [103] DBI_1.0.0 magic_1.5-8
## [105] MASS_7.3-50 fpc_2.1-11
## [107] Matrix_1.2-14 vsn_3.48.1
## [109] R.methodsS3_1.7.1 gdata_2.18.0
## [111] mlbench_2.1-1 bindr_0.1.1
## [113] gower_0.1.2 igraph_1.2.1
## [115] pkgconfig_2.0.1 foreign_0.8-70
## [117] recipes_0.1.3 MALDIquant_1.18
## [119] xml2_1.2.0 foreach_1.4.4
## [121] prodlim_2018.04.18 stringr_1.3.1
## [123] digest_0.6.15 pls_2.6-0
## [125] graph_1.58.0 rmarkdown_1.10
## [127] dendextend_1.8.0 GSEABase_1.42.0
## [129] curl_3.2 kernlab_0.9-26
## [131] shiny_1.1.0 gtools_3.8.1
## [133] modeltools_0.2-21 nlme_3.1-137
## [135] jsonlite_1.5 interactiveDisplay_1.18.0
## [137] viridisLite_0.3.0 limma_3.36.2
## [139] pillar_1.2.3 httr_1.3.1
## [141] DEoptimR_1.0-8 survival_2.42-4
## [143] interactiveDisplayBase_1.18.0 glue_1.2.0
## [145] FNN_1.1 gbm_2.1.3
## [147] prabclus_2.2-6 iterators_1.0.9
## [149] bit_1.1-14 class_7.3-14
## [151] stringi_1.2.3 blob_1.1.1
## [153] caTools_1.17.1 memoise_1.1.0
## [155] e1071_1.6-8